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Abstract 

We apply random matrix theory to compare correlation matrix estimators C ob- 
tained from emerging market data. The correlation matrices are constructed from 
10 years of daily data for stocks listed on the Johannesburg Stock Exchange (JSE) 
from January 1993 to December 2002. We test the spectral properties of C against 
random matrix predictions and find some agreement between the distributions of 
eigenvalues, nearest neighbour spacings, distributions of eigenvector components 
and the inverse participation ratios for eigenvectors. We show that interpolating 
both missing data and illiquid trading days with a zero-order hold increases agree- 
ment with RMT predictions. For the more realistic estimation of correlations in 
an emerging market, we suggest a pairwise measured-data correlation matrix. For 
the data set used, this approach suggests greater temporal stability for the lead- 
ing eigenvectors. An interpretation of eigenvectors in terms of trading strategies is 
given, as opposed to classification by economic sectors. 
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1 Introduction 



Correlation matrices are common to problems involving complex interactions and the 
extraction of information from series of measured data. Our aim is to determine empirical 
correlations in price fluctuations of daily sampled price data of distinct shares in a reliable 
way. Our investigation is based on 10 years of daily data for 250-350 traded shares listed 
on the JSE Main Board from January 1993 to Dec 2002. 
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There are several aspects to the question of how to calculate correlations in financial time 
series. In particular, missing data and thin trading (no prices changes for a stock over 
several time periods) may be significant. Random correlations in price changes are likely 
to arise in an ensemble of several shares. Furthermore, for a portfolio of N distinct assets, 
there will be A^(A'^ ~ l)/2 entries in a correlation matrix which has been determined from 
time series of length L. When L is not large, the calculated covariance matrix may be 
dominated by measurement noise. Hence, it is necessary to understand effects of (i) noise 
(ii) finiteness of time series (iii) missing data and (iv) thin trading in determination of 
empirical correlation. 

The properties of random matrices first became known with Wigner's seminal work in the 
1950's for application in nuclear physics in the study of statistical behaviour of neutron 
resonances and other complex systems of interactions ([23], [5] and [14]). More recently 
random matrix theory has been applied to calibrate and reduce the effects of noise in 
financial time series and to investigate constraints on rational (empirically based) deci- 
sion making (cf. [13], [17], [15], [28], [8] [26], [10], [9] ). Correlation matrices are computed 
for the data under investigation and quantities associated with these matrices may be 
compared to those of random matrices. The extent to which properties of the correlation 
matrices deviate from random matrix predications clarifies the status of the information 
derived from the computation of covariances. In several studies of shares traded in the 
S&P 500 and DAX, it was found that, aside from a small number of leading eigenval- 
ues, the eigenvalue spectra for the measured data coincide with theoretic random matrix 
predictions, i.e. it was found that the estimation of covariances is dominated by random 
noise. In [24] , postulated a model for the correlations which explained the observed spec- 
tral properties. RMT has also been shown to yield an improved estimation technique: an 
estimated correlation matrix can be filtered by removing the contributions of eigenvalues 
which lie in the RMT range. In [25] it is shown that noise levels in the correlation matrix 
depend on the ratio N : L, where N denotes the number of stocks and L denotes the 
length of the time-series. 



1.1 Correlation matrices and missing data in an emerging market 

In this paper we consider the problems of missing data and thin trading in determination 
of empirical correlation in daily sampled price fluctuations. We analyze the data base 
containing prices Si{t), the prices of assets i = 1, . . . , N at time t as follows. We first find 
the change in asset prices 



ri{t) = In Si{t + At) - In Si{t). 



The usual cross- correlation matrix for idealized data (non-zero price fiuctuations and no 
missing data) is given by 
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Cij :— 



where(. . .) denotes average over period studied and af := (rf) — (ri)^ is the variance of 
the price changes of asset i. Alternatively one could write 



1 ^ 

^ t=l 

where L denotes the uniform length of the time series and Ri (t) denotes the price change 
of asset i at time t such that the average values of the R[s have been subtracted off and 
the R'^s are rescaled so that they all have constant volatility af := (R^) = 1. This is 
written as C = -^MM^ where M. is a, N x L matrix and is its transpose (cf. [17]). 

The pairwise measured- data cross-correlation matrix using the pairwise deletion method 
[33], [34]) for the case when there is missing data in time series of returns is computed as 
follows: 



P ._ jPiPj) - {pi) {pi 



where pi and pj denote subseries of and Vj such that there exists measured data for 
both Pi and pj at every time period in the subseries, and (...) denotes average over period 
studied, af := {pj) — (pi)"^ is the variance of the price changes of asset i. 



2 Random Matrix Theory (RMT) predictions 



We summarise four known universal properties of random matrices, namely the Wishart 
distribution for eigenvalues, the Wigner surmise for eigenvalue spacing, the distribution 
of eigenvector components and the inverse participation ratio for eigenvector components, 
which will be applied in our analysis. 

Let A denote an N xL matrix whose entries are i.i.d random variables which are normally 
distributed with zero mean and unit variance. As A^, L — > oo and while Q — L/N is 
kept fixed, the probability density function for the eigenvalues of the Wishart matrix (or 
Laguerre ensemble) R — j^AA^ is given by ([1], [11], [31]): 



/^\ Q V (''^max - A)(A - Amin) 

^^^^=2;^ A 

for A such that Amin < A < Amax, where Amin and Amax satisfy 
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Amax/min=l + ^±2^1/g. (2) 

The distribution of eigenvalue spacings was introduced as a further test for the case 
when the empirical eigenvalue distribution does not deviate significantly from the RMT 
predication. The so-called Wigner surmise for eigenvalue spacings [23] is given by 



S STT^ 

P{s)^^exp{ — —), (3) 
where s = ( Aj+i — \i)/d and d denotes the average of the differences Aj+i — Aj as i varies 

2 

It has been found that the eigenvector components v\ for a = 1 ... n of an eigenvector 
are normally distributed with zero mean and unit variance [14], [4], 



1 

P{u) = -^exp{-—). (4) 

The inverse participation ratio (IPR) is used to analyze the structure of the eigenvectors 
of the correlation matrix [28] . The ith component f ^ of corresponds to the contribution 
of the ith time series to that eigenvector. To quantify this contribution, the IPR for is 
defined 



N 

la = E«)'> (5) 

i=l 

where is the number of time series (the number of shares) and, hence, number of 
eigenvalue components. If the components of the eigenvector are identical, = then 
la = l/N; if there is only a single non-zero component, then /„ = 1. In general, the IPR is 
a reciprocal of the number of eigenvector components which are contribute significantly, 
i.e. which are different from zero. It is found that E[Ia] — 3/N since the kurtosis for the 
distribution of eigenvector components is 3. 



3 Analysis of Johannesburg Stock Exchange data 



The JSE is one of the 20 largest national stock markets in the world. We summarise 
some of its known qualitative features. Although many of the main board JSE shares 
are illiquid, the market as a whole is a fairly liquid one. There is share concentration in 
half-dozen shares: these dominant shares account for almost a third of the index and have 



Unfolded eigenvalues are used in practice [5], [14] [28]. 
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a large bias towards resources. The resources sector in turn is strongly correlated with 
the dollar-rand exchange rate, an exogenous factor that has a dominant influence on price 
dynamics in South African stock markets. Next it is noteworthy that different shares are 
hsted on the JSE at different times and, hence, different shares do not always trade on the 
same day. However, some shares which do not trade often may occasionally trade in large 
volumes for several days. These realities exacerbate the problem of estimating correlations 
in a reliable way. 

The data set used in this study incorporated a zero-order hold for prices when there was no 
trading. This approach accounts for sequences of zero- valued returns in the rctiirn times- 
series even though no measurements occurred. While it has often been convenient to set the 
returns to zero in the periods preceding listing of shares to avoid data holes, in general, this 
strategy seems to give rise to a significant gaussian component to estimated correlations. 
We investigate the effect of various treatments and interpretations of measurements in 
the context of price time-series. The approach here favours the notions that (1) if no 
price was discovered for a given share then there was no measurement, and (2) share 
cross-correlations can only be computed when there are measurements on the same day. 

3. 1 Filtering and partitioning the data 

The data set of 10 years of data from 1 January 1993 to 31 December 2002 was split 
into annual epochs. The data was windowed to create 6 sets of 5 years of daily price 
data. Each block was screened to remove shares that were de-listed or which traded quite 
infrequently. For each year in a given epoch of 5 years, this was achieved by dropping all 
shares that neither recorded price measurements at year-end nor traded at least once in 
the preceding month. Table 1 gives the data sets used. 

In Figure 1 we reconstruct price indices (not total price) from the market capitalization 
of each individual stock based on the economic sector membership, i.e. a weight in a 
particular index would be the stocks market capitalization divided by the total portfolio 
market capitalization. The reason for this is that there is no complete constituent history 
available over for the full 10 year period studied - the indices were reconstructed by the 
authors. This also ensures consistency between the indices provided and the stocks and the 
stock data used in the study. Figure 1 corroborates evidence of negative correlation which 
is presented when we consider temporal stability of our results in Subsection 3.4 below 
- the dominance of the financial sector peaks in 1998 for the period under investigation; 
thereafter the resources sector begins to dominate the market. 

3.2 Three estimates of cross- correlations 

We investigate correlation structure by considering correlation matrices of the data sets 
in Table 1 in three different ways. Case 1: we assign the value of zero whenever there is 
no measured data for a return ri(t) for asset i at time t.; we then compute the correlation 
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Historical context of the six 5-year data epochs 




01-Jan-1993 30-Dec-1994 31-Dec-1996 31-Dec-1998 29-Dec-2000 31-Dec-2002 



Fig. 1. Historical context of price evolution for the period studied: Indices are reconstructed 
using the market capitalization of stocks in the industrial, financial and resources sectors as 
well as the entire market for the market index. The horizontal lines labelled EP#1, EP7^2 to 
EP#6 demarcate time windows 1993-1997, 1994-1998 to 1998-2002, respectively. Specific dates 
are highlighted by vertical lines labelled: D#l - 27 Apr 1994 - the first SA elections of the 
post-Apartheid era, D#2 - 17 Aug 1997 - Russian GKO default, D#3 - 10 Apr 2000 - proxy 
date for Nasdaq crash, D#4 - 20 Dec 2001 - SA Rand (ZAR) crash D#5 - 27 Jul 2002 - 
Sarbanes-Oxley Act. Inset: The ZAR/USD exchange rate evolution from Jan 1998 to Dec 2002. 

matrix in the usual way as described in the introduction. Case 2: we compute the pairwise 
measured- data correlation matrix to overcome missing measurements. Case 3: we address 
the problem of no trading, i.e. zero price fluctuations for several time periods in succession. 
To do so, in the event of 2 or more successive zero- valued price fluctuations we delete the 
measured return value ri{t) = 0, effectively turning the zero- valued information into 
missing data. This compensates for interpolated prices being mistaken for measurements. 
We then compute the pairwise measured-data correlation matrix. 

We considered the problem of non-positive definiteness (this property is destroyed by 
most missing-data methods, including the one which we implement) by applying the 
area-minimizing algorithm of [7] to make the matrices in Case 3 positive semi-definite 
(see also [19]). Lastly, as a further case, we removed bias from the data as a means of 
removing the market mode (cf. [28]). Details for these cases are not included since they 
do not add to our discussion on the phenomenology on missing data in this paper. 

We note that there are several other methods for treating missing data (see for example 
[33] and [34]). Our view is that the pairwise deletion method offers a sufficiently robust 
correlation estimate for the purposes of this analysis of daily data. Listwise deletion of an 
entire day's records for a day on which there is missing data for a single stock is likely 
to delete useful data for remaining stocks and results in too few remaining records. Mean 
substitution and imputation by regression are likely to introduce spurious correlations 
between stocks which are not listed or which do not trade for several successive periods - 
this is borne out by the comparison of results between Cases 2 and 3. For the application 
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Epoch #6, Case 1 Epoch #6, Case 2 Epoch #6. Case 3 




100 200 300 100 200 300 100 200 300 

Slocks Slocks Slocks 



Fig. 2. Missing data is depicted in red, while reasonable data is white space. The three graphs 
are, from left to right are: Case 1, which includes zero-padding and zero-order hold. Case 2 
which has only zero-order hold, and Case 3 which has measured data only. These demonstrate 
the extent of missing data. The graph reflects daily data from Epoch 6 (1998-2002). 



^ -^Q* Epoch #6, Case 1 x 10'^ Epoch #6, Case 2 Epoch #6, Case 3 




1 290 1 300 1310 1 320 500 1 000 1 500 500 1 000 1 500 

Normalization Normalization Normalization 



Fig. 3. Frequency distributions of normalisations used in the computation of pairwise correlations 
are plotted as a histogram for each Case considered in Figure 2. See also Table 1. 

of hot deck imputation it is not at all clear which strings of return data it would be 
appropriate to draw from over the varying time windows. The most promising alternative 
missing data method would be the use of an expectation maximization algorithm and this 
is left for further investigation. 

The different treatments of the data have significant impact on the relationship between 
the number of meaningful data points and the normalisation factors used in the compu- 
tations. 

Figure 2 illustrates the occurrence of missing data. In Case 1, zero padding fills all the 
gaps so there appears to to be no missing data. In Case 2, there is missing data for shares 
which were not listed at the start of the epoch and constant prices are recorded when no 
trading has taken place. In case 3, we remove prices when no trading has occurred. Stocks 
are ordered by market capitalization from left to right. The concentration of red on the 
right of the (c) is clear evidence that smallest cap stocks tend to trade quite infrequently. 
Horizontal bands indicate public holidays. That zero order hold was used to interpolate 
missing data on public holidays within the data set obtained, is a prime example of how 
measurement error can contaminate a database. 

Figure 3 plots the frequency of the normalisation factors used to compute entries in the 
correlation matrices. For cases 2 and 3, incorrect normalisation factors for (truncated) 
pairwise matched timeseries would have the effect of distorting correlation estimates when 
there is missing data. In Case 1, the normalisation factor is the same for all pairs, i.e. it is 
equal to L ^ 1305 , the total number of official trading days in the 5-yr epoch (see table). 
In Case 2, the normalisation factor is often much larger than the number of days for which 
shares are actually traded. Here, normalisations in the order of L were frequently used 
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Table 1 

The data sets used comprising shares traded on the Johannesburg Stock exchange. Each data 
set starts on the 1 January of the starting year and ends on the 31 December of the ending year. 





Data set (Epoch) 


1 


2 


3 


4 


5 


6 




Start date (1 Jan) 


1993 


1994 


1995 


1996 


1997 


1998 




End date (31 Dec) 


1997 


1998 


1999 


2000 


2001 


2002 




Total no. of shares (N) 


253 


296 


321 


330 


336 


341 




No. of trading days (L) 


1304 


1305 


1306 


1306 


1305 


1305 




No. of shares used 


253 


296 


321 


330 


336 


341 


Case 1 


% zero returns 


73% 


70% 


63% 


58% 


53% 


54% 




% missing data 






















No. of shares used 


253 


296 


321 


330 


336 


341 


Case 2 


% zero returns 


54% 


44% 


41% 


40% 


42% 


46% 




% missing data 


18% 


24% 


23% 


18% 


13% 


8% 




No. of shares used 


244 


282 


308 


310 


316 


319 


Case 3 


% zero returns 


12% 


11% 


12% 


14% 


14% 


15% 




% missing data 


59% 


56% 


50% 


42% 


38% 


36% 



even though there was substantially less measured price data. In Case 3, normalisation 
factors varied from less than 60 up to L, with normalisations of 330-390 occurring most 
frequently. 



3.3 Spectral properties and comparisons with RMT predictions 

In [17], investigating daily price fluctuations for N=406 stocks of the S&P 500 for L=1309 
days during 1991-1996, with Q = 3.22, it was found that the leading eigenvalue was 
Ri25 times greater than the RMT predicted Xmax- Adjusting for the total variance (J2 
of the price fluctuations, it was found that 94% of the spectrum could be attributed 
to random noise. In [27], the high-frequency TAQ database published by NYSE for the 
period 1994-1995 was analysed: using 30 min returns for N=1000 companies with L=6448, 
it was found again that the leading eigenvalue was ~25 times larger than the RMT 
predicted value for Xmax = 1-94 and that ~98% of the eigenvalues could be accounted for 
as random noise effects. Results of [27] were corroborated in the more extensive study [28] 
of the same high-frequency TAQ data together with CRSP databases of daily data for 
common stocks in the NYSE beginning 1925, the AMEX beginning 1962 and the NASDAQ 
beginning 1972. Investigation of further universal properties confirmed that for eigenvalues 
within the Wishart range and their corresponding eigenvectors: (a) the distribution of 
nearest-neighbour spacings were in good agreement with Gaussian Orthogonal Ensemble 



8 



(a) Eigenvalue PDF (b) Nearest Neighbour (unfolded) 




^k+1 



Fig. 4. Daily price returns for JSE main board shares for years 1998-2002 are used to investigate 
eigenvalue structures of three estimated correlation matrices. Figure 4 (a) shows the eigenvalue 
density functions with the distinct eigenvalues greater than the maximum RMT predicted value 
for the same Q-factor as the sample. Insets: plots of the Wishart distribution (Eqn. 1) are 
superimposed on plots of the small eigenvalues. Figure 4 (b) shows the nearest-neighbour 
distributions of the folded eigenvalues. Superimposed on these are plots of the Wigner Surmise 
( Eqn. 3). The folded eigenvalues were computed using Gaussian broadening and numerical 
integration. 

predictions, (b) the distribution of eigenvector components conformed with the predicted 
Gaussian distribution with zero mean and unit variance and (c) almost all the eigenvector 
components contributed equally to the inverse participation ratio except for eigenvectors 
corresponding to eigenvalues outside the RMT bounds. In the latter cases it was found 
that almost all stocks participated in the largest eigenvector while for the remaining large 
eigenvectors there was localization, i.e. only a few stocks contributed to them. Similar 
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analysis was conducted on high-frequency data for the DAX for the period Nov 1997 to 
Dec 1998 to examine intraday dynamics and memory effects in the index [9]. 

In this section we investigate the same properties in the context of an emerging market. 
Figure 4 gives a comparison of the eigenvalue densities and nearest-neighbour spacings for 
the three different cases considered for the last epoch, 1998-2002. It is clear from the graph 
that most of the eigenvalues for Case 1 are within the range of the Wishart distribution 
(Eqn. 1). For Case 2, the number of eigenvalues within the noise range is slightly reduced. 
Some of the eigenvalues are negative in this case. For Case 3 there is a more significant 
drop in number of eigenvalues in the noise band compared to Case 1; there are also more 
negative eigenvalues in Case 3 compared to Case 2. In all Cases the nearest-neighbour 
spacings indicate some agreement with the Wigner surmise (Eqn. 3). 

It is clear from the presence of negative eigenvalues in Cases 2 and 3 that the matrices 
obtained arc no longer positive definite. There arc several algorithms to obtain positive 
definite matrices from a non-positive definite estimate [7], [19]. A thorough comparison of 
these methods (including their impact on eigenvalue distributions and temporal stability) 
is a separate topic of investigation. 

For Case 3 we found that ~88% of the eigenvalues (including negative values) fell be- 
low Xmax = 2.23 (a smaller percentage than in [17], [27] and [28]) and that the largest 
eigenvalue, A = 21.20, was ~9.5 times great than Xmax (significantly less than results 
for developed markets [17], [27] and [28]). The high percentage of eigenvalues below A^m 
may be attributed to the fact that many of the less liquid stocks behave independently 
relative to the rest of the market. While a null-hypothesis of Gaussian returns is useful 
for identifying how zero-padding and zero-order hold add noise to data, it is possible that 
the noise range [Xmax, Xmin] is wider than suggested by equation ( 1). Simulations for 
time-series with Gaussian returns populated appropriately with missing data yield almost 
identical distributions of eigenvalues as the Wishart distribution. Different stocks in the 
SA market exhibit a range of return phenomenologics. including periodic and aperiodic 
behaviour [32]; hence, the construction of a representative null-hypothesis becomes prob- 
lematic. From qualitative information about the market, results for inverse participation 
ratios, temporal stability and style characteristics (discussed in sections below) and by an 
analysis of the dimensionahty of the SA market [29], a more realistic estimate seems to 
be that 8-9 eigenvalues are associated with information content (Ri2% of the total). 

Figure 5 (a) gives the distributions of component values for eigenvectors corresponding 
to the 1"*, 2"'^, 3'"'^, lO*'* and lOO*'* largest eigenvalues for the last epoch, 1998-2002. In 

all three cases the first three distributions deviate significantly from the Porter-Thomas 
null-hypothesis (Eqn. 4); the distributions corresponding to smaller eigenvalues are in 
greater agreement with their random matrix counterparts. In Case 3, the components for 
the leading eigenvector are mostly positive valued (as in [17], [27] and [28]). 

Figure 5 (b) gives the inverse participation ratios (IPR's) plotted against corresponding 
eigenvalues. In all three cases, the IPR for the leading eigenvector is approximately equal 
to the RMT prediction of 3/N (Eqn. 5), indicating contributions from almost all stocks in 
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Table 2 

Average percentage of variance explained by the leading eigenvalues (average taken over 6 
epochs) . 







Case 1 


Case 2 


Case 3 


Total variance 












% explained by eigenvalues 1 — 


5 


27% ± 5% 


25% 


± 4% 


28% ± 4% 


% explained by eigenvalues 1 — 


15 


43% ± 5% 


41% 


± 4% 


50% ± 4% 


Trace of Correlation matrix 












% explained by eigenvalues 1 — 


5 


11% ± 1% 


12% 


± 1% 


15% ± 1% 


% explained by eigenvalues 1 — 


15 


18% ± 2% 


20% 


± 2% 


25% ± 2% 



the market. This is consistent with findings in [28]. For Case 1, the IPR's for the 2"'' and 
grd ia];ggs^ Q^Yid the 7 smallest eigenvalues deviate significantly from the random matrix 
null case. For Cases 2 and 3, the IPR's for largest 6 and 9 eigenvalues, respectively, 
deviate significantly from the random matrix null case, indicating contributions from 
only a few stocks (as in [28]); the same is true for several of the smallest eigenvalues for 
Case 2; for Cases 2 and 3, the rest of the eigenvalues, the mean IPR is greater than the 
RMT prediction of 3/N and IPR values fluctuate with greater variance about this mean 
compared with variance for the null case. 

3.4 Temporal stability of the correlation matrices 

Temporal stabilities of the matrices were investigated for annual variation. In [28], it was 
found that the largest four eigenvectors obtained from high-frequency data (returns at 
30-min intervals) were stable up to time-lags of 1 year, while the largest two eigenvectors 
from 30 years of daily data were stable for time-scales up to 20 years. 

In this investigation, we computed overlap matrices with entries given by estimated cor- 
relations between the leading eigenvectors from the last 5— yr epoch, 1998 — 2002, with 
their analogues from the preceding epochs, 1997 — 2001 to 1993 — 1997, as follows: for 
each epoch, the eigenvectors corresponding to the 15 largest eigenvalues were chosen; each 
eigenvector was expanded to include components for every share present in the epochs 
being compared and when a share was not included in one of the epochs, a value of zero 
was assigned ^ . We let U{E) denote the x 15 matrices whose columns are the leading 
15 eigenvectors from the E*'^ epoch, ^ = 6 for 1998 - 2002, E = 5 for 1997 - 2001, etc. 
The overlap matrices are hence given by 0{t,T) — U{t)^U{t — r), where t denotes an 
epoch and r denotes a lag in years. 

Figure 6 gives an indication of the temporal stability of the eigenvectors associated with 
the largest eigenvalues for Cases 1 and 3. 

^ This takes into consideration the fact that each epoch is comprised of slightly different collec- 
tions of shares 
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Fig. 5. Daily price returns for JSE main board shares for years 1998-2002 are used to investi- 
gate eigenvectors of three estimated correlation matrices. Figure 5 (a) gives the distributions of 
component values for eigenvectors corresponding to the 1**, 2"'^, 3'''^, 10*^ and 100*^ largest eigen- 
values. Plots of the Porter-Thomas distribution (Eqn. 4) are superimposed. Figure 5 (b) gives 
plots of the inverse participation ratios for the three cases together with the RMT prediction, 
E[Ia\ = 3/N, using kurtosis of Eqn. 4. 

In Case 1 the correlation of the eigenvectors corresponding to the largest eigenvalue is 1 
for all lags; the correlation of the eigenvectors corresponding to the second eigenvalues is 
negative for lags 1 to 4 and the correlation of the eigenvectors corresponding to the third 
eigenvalues is positive for lags 1 to 4. 

In Case 3, temporal correlations of the eigenvectors corresponding to the largest eigen- 
value is alternately positive then negative for successive lags; correlations for the second 
eigenvector are positive except for time lags of 2 and 5 years where there are no correla- 
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Fig. 6. Overlap matrices are computed using 6 sets of 5-yr epochs of daily returns. Figure 6 (a) 
and (b) give results for Cases 1 and 3, respectively. The graphs depict estimated correlations 
between the 15 leading eigenvectors from the last epoch, 1998 — 2002, with their analogues from 
the preceding epochs, 1998 - 2002 (lag 0), 1997 - 2001 (lag 1) to 1993 - 1997 (lag 5). 



tions; correlations for the third eigenvalue are insignificant at delays of 1 and 2 years, then 
tend to be positive for delays of 3-5 years; correlations between eigenvectors correspond- 
ing to the next seven eigenvalues tend to be positive for lags 1 to 5 years. The negative 
correlations in Case 3 are consistent with the negative correlation between the financial 
and resources sectors and the switch in performance of corresponding indices following 
the market stresses in emerging markets in the 2"'^ half of 1998 (cf. Fi gure 1). This case 
offers the greatest evidence of temporal stability. The overlap matrices are all computed 
against Epoch number 6, 1998-2002, where the market is also influenced by the crash of 
the Rand in 2001 (cf. ZAR/USD exchange rate inset in Figure 1). 



There is also evidence of stability for Case 2 - in this case the results are similar to but 
slighter weaker than those for Case 3. 
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INTERPRETATION OF EIGENVECTORS 
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Fig. 7. Fundamental characterizations of components of the eigenvectors for Cases 1 and 3 for 
1998-2002 are given in Figure 7 (a) and (b), respectively. Characteristic spectra are plotted in 
increasing order: the lowest band represents the spectrum for the first eigenvector, the next spec- 
trum corresponds to the second eigenvector, etc. Characteristics used are: market capitalization, 
volume traded, dividend yield, earnings per share. For each fundamental characteristic, values 
were mapped to [0,1]; negative values for components (on the left) indicate short positions. 



3. 5 Interpretation of leading eigenvectors 



Several studies have investigated market segmentation or clustering via metrics obtained 
from correlation matrices ([21], [15], [16] and [3]). In [28], the authors are able to interpret 
eigenvalues deviating from the RMT noiseband in a similar way. In that investigation, 
because it was an order of magnitude greater than the rest, the effect of the leading 
eigenvalue was removed from the data by regressing each stock return times series against 
the leading eigenmode to obtain a new correlation matrix from stock specific return com- 
ponents (as in the Capital Asset Pricing Model). The new leading eigenvector exhibited 
significant contributions from about 1/3 of the 999 stocks, all with large values for market 
capitalization. The next 9 eigenvectors all contained stocks belonging to distinct economic 
sectors. 
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In this investigation, the leading eigenvahie for Case 3 was found to contribute much less 
significantly to the overall trace compared to the rest of the large eigenvalues. Moreover, its 
eigenvector components were not temporally stable, but displayed anti-correlations over 
time as the behaviour of market participants changed through the 1997 Russian GKO 
default, the 1998 emerging market contagion and the 2001 crash of the SA currency. 

Since the leading eigenvectors could not be identified with distinct economic sectors, fun- 
damental characteristics of each eigenvector were considered ^ . Eigenvector components 
were weighted according fundamental properties and particular attention was paid to 
eigenvectors corresponding to the largest eigenvalues. The fundamental properties consid- 
ered were: market capitalization, volume traded, dividend yield and earnings per share. 
These variables were normalized and mapped to numbers between zero and unity. Eco- 
nomic sectors were interpreted in terms of spectra, ranging from resources, through indus- 
trial, non-cyclical and then cyclical shares into financial and then technology shares, as 
classified by the JSE, from smaller values to larger values and rescaled for the fundamental 
characteristic graphs. 

This representation offers a method to inspect the differences between the compositions 
of the eigenmodes corresponding to different eigenvectors. 

Figure 7 gives representations of the fundamental characteristics for eigenvectors corre- 
sponding to eigenvalues ranging from largest (bottom) to smallest (top of range). We 
include results for Cases 1 and 3. 

This analysis, together with the tests of temporal stability, suggests that the eigenmodes ^ 
are not easily interpreted in terms of isolated characteristics. Instead eigenmodes may be 
viewed as being representative of distinct trading strategies prevalent in the market itself. 
This conclusion is motivated by the observation that negative component values of the 
eigenvectors imply shorting and the positive values imply long positions. 

It can be deduced from Figure 7 that those eigenvectors which are associated with eigen- 
values lying in the noise band seem to correspond to trading strategies which hold roughly 
equal long and short positions in mid-capitalization stocks. Case 1 and Case 3 are quali- 
tatively the same in this sense for eigenvalues in the noise band. 

For Case 1, the first eigenvector is long in large capitalization, large volume resource and 
industrial shares. The next leading eigenvector carries short positions in large capitaliza- 
tion, large volume shares. For all the eigenvectors in this case there is indistinct economic 
sector characterisation. This gives further evidence that there is information loss with 
zero padding and zero-order hold. 

For Case 3 of Epoch 6, the leading eigenvector has similar long positions as in Case 1. 
The second eigenvector is quite different from the first for Case 3, as well as from its 

^ Fundamental is used in the 'bottom up' sense of financial analysis, i.e. in terms of unique 

characteristics of stocks such as earnings, dividends, market capitalization, book vahic, etc... 
^ The cigcnmode for each eigenvector is the timeseries derived from the timeseries of the eigen- 
vector components. Leading eigenmodes are also sometimes referred to as principal components. 
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SPECTRA OF THE LARGEST EIGENVECTOR FOR 6 EPOCHS 
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Fig. 8. Fundamental characteristics of components of the 1 eigenvectors for the 6 epochs for 
Case 3, where Epochs 1 to 6 demarcate time windows 1993-1997 to 1998-2002 (cf. Figure 1). 

counterpart in Case 1: t exhibits long positions in stocks with smaller capitalization and 
short positions in stocks with relatively larger capitalization; it is long in stocks with low 
earnings, low volume and low dividend yield. In general the leading eigenvectors in Case 
3 have more varied compositions. 

Small eigenvalues for Case 3 correspond to trading strategies which replicate long and 
short positions in small capitalization, low earnings and lower volume and dividend yield 
relative to the noise band. Cases 1 and 3 deviate markedly in this regard. 

Similarly, inspection of the graph shows that characteristics for the rest of the eigenvectors 
are different in Cases 1 and 3. In the former, the characteristics seems to settle to a noisy 
composition sooner. This is consistent with findings for the inverse participation ratios. 

In Figure 8 we compare the fundamental characteristics for the leading eigenmode for the 
different epochs. In Epoch 1, the 1** eigenmode is long in comparatively higher earnings, 
smaller capitalization and low volume stocks, while it is short lower earnings and large 
capitalization stocks. In contrast, in Epoch 5 (1997-2001) and Epoch 6 (1998-2002), the 
ist eigenmodes are dominated by long positions in large capitalization stocks and short 
positions in small capitalization financial stocks, where the financial sector corresponds to 
a economic sector value of 0.8. This depiction of a shift in the market is consistent with 
the historical context observed in Figure 1. 

In Figure 8 the short positions in Epoch 5 and 6 offer the only occurrences where aggre- 
gate quantification of sector participation correlates uniquely with one sector. In general, 
there is always significant share concentration in the resources sector (sector value 0.0) for 
the period of investigation as well as varying activity in industrial (sector value 0.6) and 
financial stocks. As a result, eigenmodes are not differentiated by sector participation and 
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in particular, the single aggregate quantifier for sector participation of eigenmodcs con- 
sidered in this investigation is generally ineffective. This is not surprising in an emerging 
market, where stock concentration, currency volatility and generally high co-movement 
with commodity prices blur economic sector independence. Inspection of these quantifiers 
for the second eigenmodes corroborates these findings. 



4 Conclusions 

Our investigation exposes some notable differences in the spectral properties of the correla- 
tion matrices computed by the three different methods outlined. As in preceding analyses 
of financial market data, in all cases we have found that the distribution of eigenvalues 
exhibits: (1) a significant part of the spectrum falls within the range of random matrix 
predictions, and (2) there exists a small no. of large leading eigenvalues. However, we 
found that by computing measured-data correlation matrices. Case 3, a far less substan- 
tial part of the spectrum falls within the Wishart range (Eqn. 1) than when computing 
correlations with zero padding. Case 1. Similar results were found when comparing the 
inverse participation ratios of Cases 1 and 3 with their RMT counterparts. Results for 
Case 2, which incorporated zero-order hold but not zero padding, varied with the RMT 
tests; here the eigenvector components and inverse participation ratios were closest to 
RMT predictions. 

Our investigation suggests that zero padding and zero-order hold increases the level of 
noise in the estimation of correlation matrices. The correlations between leading eigen- 
vectors from successive epochs also showed evidence of greater temporal stability when 
measured-data correlations were used. 

Our fundamental characteristic investigation suggests that the leading eigenmodes may 
be interpreted in terms of independent trading strategies with long range correlations. 
These are more distinct for the measured-data correlation case than when there was zero 
padding and zero-order hold. 
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